ABSTRACT

Cancer is one of the leading causes of death in the USA and worldwide. There are several factors that might affect the cancer mortality rate, for example, incident rates, or the gender and age of a person. According to cancer.gov, we can study the trends of cancer mortality and incidence rate to see if there are better treatments in curing cancer in case there is a negative relationship between the factors. In this data analysis, we are interested to learn if there is a the correlation of cancer mortality rate, focusing on death causes by lung cancer in the USA, with different socioeconomics factors, such as age, income or poverty status.

INTRODUCTION

There are two big questions we want to learn in this data analysis.

First, we want to examine the relationship between the lung cancer mortality rate and the incidence rate. Through this analysis, we can examine the likelihood of dying if diagnosed with a lung cancer is consistent across counties.

Secondly, we want to learn if the socioeconomic factors affects the cancer mortality rates across the counties in the USA. The chosen factors are:
1. Median Age: According to cancer.gov, the median age to be diagnosed lung cancer is 70. With different median age across the counties, we are interested to know how age affect the incidence rate in each counties.
2. Income: Income plays an important role to determine standard of living, which affects the chance of getting cancer. We want to know if income have a positive or negative relationship with incidence rates.
3. Poverty status: Similar to income, poverty status affects the standard of living. We want to know if poverty status has a significant relationship to incidence rates.
4. Health Insurance: Early diagnosis and treatments might reduce the cancer mortality rate. We want to know that relationship holds for lung cancer.

DATA DESCRIPTION

The 2 .csv files given in the dataset 1: death.csv and incd.csv were not enough for data analysis. With further research, I have found more information and work on them to add more data. I have created 4 extra .csv files with data on AgeSex, Income, Insurance and Poverty. (R Code used for these data, respectively: Data-AgeSex.R, Data-Income.R, Data-Insurance.R, Data-Poverty.R)

Importing the data:

wd <- "/Users/PhuongAnh/STAT350-Project"
setwd(wd)
death = read.csv("death.csv")
incident = read.csv("incd.csv")
agesex = read.csv("agesex.csv")
income = read.csv("income.csv")
insurance = read.csv("insurance.csv")
poverty = read.csv("poverty.csv")

Merge by FIPS and Rename the Data.

FIPS: The 5 digits code with the first 2 number is State Code and 3 last numbers are the county code
County: Name of the county and state
Total.Population: Total population in the county
Total.Male: Total Male in the county
Total.Female: Total Female in the county
Median.Age: Median Age by county
Median.Age_Male: Male Median Age by county
Median.Age_Female: Female Median Age by county
Median.Income: Median Income by county
Insurance.Coverage: Number of people is covered by at least one type of health insurance
Poverty: Number of Poverty Status
Average.Deaths.per.Year: Average Count of death by county
Age.Adjusted.Death.Rate: Age Adjusted Death Rate per 100,000
Death.Rates.Trend: Death Rate Trends
Age.Adjusted.Incidence.Rate: Age Adjusted Incidence Rate per 100,000
Incidence.Rate.Trend: Incidence Rate Trend
Average.Annual.Count: Average Count of Incidence by county

agesex = agesex[c("Total.Population","Total.Male","Total.Female","Median.Age", "Median.Age_Male", "Median.Age_Female","FIPS")]
income = income[c("Median.Income", "FIPS")]
insurance = insurance[c("Insurance.Coverage","FIPS")]
poverty = poverty[c("Poverty","FIPS")]
data = merge(x= agesex, y=income, by="FIPS", all=TRUE)
data = merge(x= data, y=insurance, by="FIPS", all=TRUE)
data = merge(x= data, y=poverty, by="FIPS", all=TRUE)
death = death[c("County","FIPS","Age.Adjusted.Death.Rate","Average.Deaths.per.Year","Recent.5.Year.Trend..2..in.Death.Rates")]
incident = incident[c("FIPS","Age.Adjusted.Incidence.Rate...cases.per.100.000","Average.Annual.Count","Recent.5.Year.Trend.in.Incidence.Rates")]
ALLDATA = merge(x=death, y=incident, by="FIPS", all=TRUE)
ALLDATA = merge(x=ALLDATA, y=data, by="FIPS", all=TRUE)
ALLDATA = ALLDATA[-which(is.na(ALLDATA$County)),]
colnames(ALLDATA)[5] <- "Death.Rates.Trend"
colnames(ALLDATA)[6] <- "Age.Adjusted.Incidence.Rate"
colnames(ALLDATA)[8] <- "Incidence.Rate.Trend"

For FIPS = 0: USA, we will use the total of each column: Total Population, Total Male, Total Female, Insurance Coverage, Total Poverty

ALLDATA$Average.Deaths.per.Year <- as.numeric((ALLDATA$Average.Deaths.per.Year))
ALLDATA$Average.Annual.Count <- as.numeric((ALLDATA$Average.Annual.Count))
ALLDATA[1,9] = sum(ALLDATA[-1,9], na.rm=TRUE) #TOTAL POPULATION
ALLDATA[1,10] = sum(ALLDATA[-1,10], na.rm=TRUE) #TOTAL MALE
ALLDATA[1,11] = sum(ALLDATA[-1,11], na.rm=TRUE)  #TOTAL FEMALE
ALLDATA[1,16] = sum(ALLDATA[-1,16], na.rm=TRUE) #INSUARANCE COVERAGE
ALLDATA[1,17] = sum(ALLDATA[-1,17], na.rm=TRUE) #POVERTY

For each county, the population are varies. Therefore, it is easier to work with percentage.

Male Percent: Total Male/Total Population
Poverty_Percent: Total Poverty Status/Total Population
Insurance_Percent: Total People with Insuarance/Total Population
Death_Rate: Average Deaths per Year/Total Population
Incidence_Rate: Average Incidence per Year/Total Population

ALLDATA$Male_Percent = ALLDATA$Total.Male/ALLDATA$Total.Population
ALLDATA$Female_Percent = ALLDATA$Total.Female/ALLDATA$Total.Population
ALLDATA$Poverty_Percent = ALLDATA$Poverty/ALLDATA$Total.Population
ALLDATA$Insurance_Percent = ALLDATA$Insurance.Coverage/ALLDATA$Total.Population
ALLDATA$Death_Rate = ALLDATA$Average.Deaths.per.Year/ALLDATA$Total.Population
ALLDATA$Incidence_Rate = ALLDATA$Average.Annual.Count/ALLDATA$Total.Population

Looking at our data, there are 6 datapoints with N/A Death Rates. This is because there is no Population Data for these 6 counties. We are removing because these 6 points because mortality rate is the targeted variable.

CLEANDATA = ALLDATA[-which(is.na(ALLDATA$Death_Rate)),]
CLEANDATA$Average.Deaths.per.Year <- as.numeric((CLEANDATA$Average.Deaths.per.Year))

Remove FIPS = 15005 and 48301 because Incident_Rate >1. There might be data inconsistency.

CLEANDATA=CLEANDATA[CLEANDATA$FIPS!="15005",]
CLEANDATA=CLEANDATA[CLEANDATA$FIPS!="48301",]

Remove entries with Average Death per Year <16. Accoording to the dataset information, the county which has an "*" for age adjusted death rates are supressed. These data may cause bias to our model.

CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="1",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="2",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="3",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="4",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="5",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="6",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="7",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="8",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="9",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="10",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="11",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="12",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="13",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="14",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="15",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="16",]

Add a DATA POINT: Average of all columns, excluding the FIPS = 0 (USA Total). There are over 3000 counties with various population, I want to get a datapoint that is the mean of everything. If this data point became an outliers, that means the data is skewed.

library(dplyr)
CLEANDATA.noUS=CLEANDATA[-c(1),]
Mean = summarize_all(CLEANDATA.noUS,mean)
finaldata = rbind(CLEANDATA,Mean)

First look at Data: Create a pairs plot to see the relationship between variables of interest.

pairs(~Death_Rate+Median.Age+Median.Income+Insurance_Percent+Male_Percent+Incidence_Rate+Poverty_Percent, data=finaldata)

We can see that there are some linear relationship between some data points such as Insurance Percent and Male Percent, Death Rate and Incidence Rate.

METHODS

To analyze the data, we will use these methods:

  1. Linear Regression Model:

    • One Linear Regression Model for Death Rate and Incidence Rate
    • One Linear Regression Model for Death Rate and Other Factors.
      Using Linear Regression Model, we can determine the relationship between dependent variable (Mortality Rate) and Other Variables (Predictors). Through this analysis, we can determine the strength of each predictors on the dependent variable.
  2. Backward Elimination We will use backward elimination to choose an optimal model for Death Rate and Other Factors. There are a lot of Factors to be consider, with Backward Selection, we can try to fit all variable to the model and eliminate one by one the factor that is not meet the requirement. We will use ANOVA Table to get the F statistics for each regressor. Upon completing Backward Elimination, we weill determine which factors are most significant to our data.

  3. Multicollinearity We would like to see if the coefficients for each factors are independent of each other. We will compute Variance Inflation Factors for each factors to measure the combined effect of the linear dependencies among the regressors.

  4. Check the influence of possible influential observations We will use hii to compute the leverages for observations. We will use Cook’s Distance to find influential data.

  5. Residual Analysis With this analysis, we want to see if the linear model is appropriate, hence, the regressors are valid. We will perform Variance Stabilizing Transformation if necessary.

RESULTS

1. LINEAR REGRESSION MODEL

First, I want to examine the relationships between Incidence Rate and Death Rate.

Death.Incd.lm = lm(Death_Rate~Incidence_Rate, data=finaldata)
plot(Death.Incd.lm)

summary(Death.Incd.lm)
## 
## Call:
## lm(formula = Death_Rate ~ Incidence_Rate, data = finaldata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.012556 -0.002961 -0.002643 -0.001307  0.060309 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.0028604  0.0001777   16.09   <2e-16 ***
## Incidence_Rate 0.6447581  0.0110380   58.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007792 on 2693 degrees of freedom
## Multiple R-squared:  0.5589, Adjusted R-squared:  0.5587 
## F-statistic:  3412 on 1 and 2693 DF,  p-value: < 2.2e-16
anova(Death.Incd.lm)
## Analysis of Variance Table
## 
## Response: Death_Rate
##                  Df  Sum Sq  Mean Sq F value    Pr(>F)    
## Incidence_Rate    1 0.20718 0.207181    3412 < 2.2e-16 ***
## Residuals      2693 0.16352 0.000061                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear Regression Model for Mortality Rate and Other Factors.

Death.Factor.lm1 = lm(Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent+Poverty_Percent, data=finaldata)
plot(Death.Factor.lm1)

summary(Death.Factor.lm1)
## 
## Call:
## lm(formula = Death_Rate ~ Median.Age + Median.Income + Insurance_Percent + 
##     Male_Percent + Poverty_Percent, data = finaldata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.023195 -0.006646 -0.003743  0.002399  0.067951 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.004e-01  1.621e-02  -6.192 6.89e-10 ***
## Median.Age         6.113e-04  5.031e-05  12.152  < 2e-16 ***
## Median.Income     -2.194e-07  3.223e-08  -6.808 1.23e-11 ***
## Insurance_Percent  2.951e-02  8.796e-03   3.355 0.000805 ***
## Male_Percent       1.360e-01  1.654e-02   8.224 3.09e-16 ***
## Poverty_Percent   -1.346e-02  6.601e-03  -2.040 0.041496 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01094 on 2552 degrees of freedom
##   (137 observations deleted due to missingness)
## Multiple R-squared:  0.1544, Adjusted R-squared:  0.1528 
## F-statistic: 93.22 on 5 and 2552 DF,  p-value: < 2.2e-16
anova(Death.Factor.lm1)
## Analysis of Variance Table
## 
## Response: Death_Rate
##                     Df   Sum Sq  Mean Sq  F value    Pr(>F)    
## Median.Age           1 0.032909 0.032909 275.1743 < 2.2e-16 ***
## Median.Income        1 0.008848 0.008848  73.9891 < 2.2e-16 ***
## Insurance_Percent    1 0.004489 0.004489  37.5346 1.037e-09 ***
## Male_Percent         1 0.009000 0.009000  75.2533 < 2.2e-16 ***
## Poverty_Percent      1 0.000497 0.000497   4.1598    0.0415 *  
## Residuals         2552 0.305198 0.000120                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2. BACKWARD ELIMINATION REGRESSION

Elimate “Poverty_Percent” from the model

Death.Factor.lm2 = lm(Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent, data=finaldata)
plot(Death.Factor.lm2)

summary(Death.Factor.lm2)
## 
## Call:
## lm(formula = Death_Rate ~ Median.Age + Median.Income + Insurance_Percent + 
##     Male_Percent, data = finaldata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.023838 -0.006619 -0.003636  0.002128  0.067459 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.094e-01  1.560e-02  -7.011 3.01e-12 ***
## Median.Age         6.609e-04  4.409e-05  14.990  < 2e-16 ***
## Median.Income     -1.654e-07  1.838e-08  -9.002  < 2e-16 ***
## Insurance_Percent  2.904e-02  8.798e-03   3.301 0.000978 ***
## Male_Percent       1.415e-01  1.632e-02   8.670  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01094 on 2553 degrees of freedom
##   (137 observations deleted due to missingness)
## Multiple R-squared:  0.1531, Adjusted R-squared:  0.1517 
## F-statistic: 115.3 on 4 and 2553 DF,  p-value: < 2.2e-16
anova(Death.Factor.lm2)
## Analysis of Variance Table
## 
## Response: Death_Rate
##                     Df   Sum Sq  Mean Sq F value    Pr(>F)    
## Median.Age           1 0.032909 0.032909 274.834 < 2.2e-16 ***
## Median.Income        1 0.008848 0.008848  73.898 < 2.2e-16 ***
## Insurance_Percent    1 0.004489 0.004489  37.488 1.062e-09 ***
## Male_Percent         1 0.009000 0.009000  75.160 < 2.2e-16 ***
## Residuals         2553 0.305695 0.000120                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We will take this model as our model.

Death.Factor.lm=Death.Factor.lm2

==> (3) Median Age, Median Income, Insurance Percent and Male Percent has significant affect on Death Rate.

3. MULTICOLLINEARITY

detach("package:dplyr", unload=TRUE)
library(car)
vif(Death.Factor.lm)
##        Median.Age     Median.Income Insurance_Percent      Male_Percent 
##          1.031513          1.096595          2.907290          2.807764

4. INFLUENTIAL OBSERVATIONS

First, we are looking for high leverage points.

finaldata.noNA = na.omit(finaldata)
X<-cbind(rep(1,length(finaldata.noNA$Death_Rate)), finaldata.noNA$Median.Age, finaldata.noNA$Median.Income, finaldata.noNA$Male_Percent)
hii<-diag(X%*%solve(t(X)%*%X)%*%t(X),)
p<-ncol(X) 
n<-nrow(X) 
which(hii>2*p/n) 
##   [1]   52   59   66   67   68   69   72   74   75   77   84  129  130  163  166
##  [16]  175  177  194  197  199  219  224  226  228  231  237  250  262  263  267
##  [31]  272  275  276  277  285  287  292  309  311  313  314  316  333  336  345
##  [46]  370  372  382  395  401  408  428  438  441  443  461  464  481  502  519
##  [61]  542  546  619  756  786  790  813  820  855  857  862  901  903  908  916
##  [76]  921  948  962  966  974 1015 1017 1019 1021 1023 1025 1026 1028 1030 1031
##  [91] 1038 1045 1046 1047 1051 1052 1057 1084 1085 1095 1106 1112 1117 1137 1192
## [106] 1201 1212 1215 1221 1226 1240 1255 1263 1270 1271 1301 1311 1341 1352 1403
## [121] 1414 1443 1477 1479 1482 1490 1492 1493 1494 1498 1499 1512 1514 1523 1557
## [136] 1567 1570 1578 1586 1652 1671 1693 1704 1709 1725 1765 1792 1815 1843 1867
## [151] 1899 1904 1916 1933 1987 2002 2007 2029 2052 2062 2069 2084 2087 2095 2111
## [166] 2114 2128 2132 2138 2141 2146 2147 2152 2162 2166 2173 2188 2193 2195 2198
## [181] 2208 2224 2228 2233 2242 2244 2251 2254 2266 2269 2286 2287 2293 2296 2298
## [196] 2301 2303 2305 2307 2311 2314 2318 2323 2328 2338 2346 2348 2353 2354 2355
## [211] 2358 2360 2361 2365 2367 2373 2388 2400 2409 2421 2464 2527 2529
library(olsrr)
ols_plot_cooksd_chart(Death.Factor.lm)

5. RESIDUAL ANALYSIS

hist(Death.Factor.lm$residuals)

plot(density(Death.Factor.lm$residuals))

qqnorm(Death.Factor.lm$residuals)

linear.student.resid = rstudent(Death.Factor.lm)
plot(Death.Factor.lm$fitted,linear.student.resid)
title("Studentized residuals versus predicted")

The residual histogram shows a high frequency (>1500) residual points from -0.01 to 0.00. This makes the residual skewed to the left, and there are more lower value than predicted. The QQ Plot also shows that the linear doesn’t follow a Normal distribution

==> We will try to transform the the data y=ln(y)

Variance stabilizing transformations

ln.Death_Rate=log(finaldata$Death_Rate)
linear.ln.Death = lm(ln.Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent+Poverty_Percent, data=finaldata)
plot(linear.ln.Death)

linear.ln.Death.student.resid = rstudent(linear.ln.Death)
plot(linear.ln.Death$fitted.values, linear.ln.Death.student.resid)
title("Studentized residuals versus fittled values")

hist(linear.ln.Death$residuals)

plot(density(linear.ln.Death$residuals))

qqnorm(linear.ln.Death$residuals)

With the data transformed from y to ln(y):
- Residuals vs Fitted: The residuals are more spread (doesnot have any patterns). The model looks more linear.
- Normal QQ Plot, Histogram and Density: The residual follow closer to Normal Distribution.
- Cook’s Distance Chart: If using threshold=0.05, there will be 2 leverage points 428 and 2365.

==> Log-transforming worked to stabilize the variance.

CONCLUSION

From our data analysis, we have the results:

  1. There is a positive relationship between the Mortality Rate and Incidence Rate.
  2. Poverty is not a significant to Mortality Rate.
  3. Median Age, Median Income, Insurance Percent and Male Percent has significant affect on Death Rate.
  4. There is no multicollinearity between the regressors.

In conclusion, we find that Incidence Rate,Median Age, Median Income, Insurance Percent and Male Percent affects the Lung Cancer Mortality Rate for counties in the USA. However, there can still be improvements to the model, recommend to look for extra factors. The data observed needs more investigation. Maybe separate based on some criteria (small, large population) to fit each model better.

APPENDIX

https://github.com/annalpapham/STAT350-Final-Project